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Abstract 

Even though Darwin's "On the Origin of Species" implied selection being the main driver of species formation, the role of 
natural selection in speciation remains poorly understood. In particular, it remains unclear how selection at a few genes 
can lead to genomewide divergence and the formation of distinct species. We used a particularly attractive clear-cut case 
of recent plant ecological speciation to investigate the demography and genomic bases of species formation driven by 
adaptation to contrasting conditions. High-altitude Senecio aethnensis and low-altitude S. chrysanthemifolius live at the 
extremes of a mountain slope on Mt. Etna, Sicily, and form a hybrid zone at intermediate altitudes but remain mor- 
phologically distinct. Genetic differentiation of these species was analyzed at the DNA polymorphism and gene expression 
levels by high-throughput sequencing of transcriptomes from multiple individuals. Out of ~ 18,000 genes analyzed, only a 
small number (90) displayed differential expression between the two species. These genes showed significantly elevated 
species differentiation (F ST and D xy ), consistent with diversifying selection acting on these genes. Genomewide genetic 
differentiation of the species is surprisingly low (F ST = 0.19), while -200 genes showed significantly higher (false discovery 
rate < 1%; mean outlier F ST > 0.6) interspecific differentiation and evidence for local adaptation. Diversifying selection at 
only a handful of loci may be enough for the formation and maintenance of taxonomically well-defined species, despite 
ongoing gene flow. This provides an explanation of why many closely related species (in plants, in particular) remain 
phenotypically and ecologically distinct despite ongoing hybridization, a question that has long puzzled naturalists and 
geneticists alike. 
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Introduction 

Understanding how species arise is a long-standing goal of 
evolutionary biology. Contrary to Darwin's view of the Origin 
of Species by Means of Natural Selection (Darwin 1859), spe- 
ciation does not always occur due to natural selection, as 
species may arise via gradual accumulation of differences in 
isolated populations or via saltational change in chromosome 
number/structure causing immediate reproductive isolation 
between a pair of neospecies. In fact, this nonselectionist view 
of the speciation process dominated evolutionary biology 
over a significant proportion of last century (Mayr 1942). 
More recently, the evidence in favor of selection-based ecol- 
ogy-driven speciation has started to accumulate (Schluter 
2009; Nosil 2012), and the focus of speciation studies has 
shifted to the analysis of closely related species that are still 
exchanging genes (Wolf et al. 2010; Smadja and Butlin 2011; 
Feder et al. 2012). Such intermediate stages of the speciation 
process have the potential to shed light on the very moment 
of speciation (as opposed to later stages of gradual diver- 
gence) and to reveal the relative importance of a range of 
factors in the development of reproductive isolation, 



including locus-specific and genomewide divergent selection, 
and genetic drift (Schluter 2009; Nachman and Payseur 2012; 
Nosil 2012). 

Ecological speciation gives rise to populations that are 
locally adapted such that if an individual from one population 
finds itself in the habitat of the other then it is less fit 
("immigrant inviability"; Nosil et al. 2005). At the genetic 
level, the alleles for local adaptation are unfit in the alternate 
environment and therefore remain solely or mostly within the 
species in which they originated. Other loci, which do not 
affect ecological differences, are not subject to such strong 
diversifying selection and therefore, at least in the early stages 
of speciation, will not show such strong interspecific differen- 
tiation, creating a heterogeneous pattern of divergence 
throughout the genome and so-called speciation islands 
(Wu 2001; Nosil et al. 2009). 

This mosaic of divergence between species has been un- 
covered in several species (e.g., Rieseberg et al. 1999; Scotti- 
Saintagne et al. 2004; Lawniczak et al. 2010; reviewed in Nosil 
et al. 2009) and may be particularly prominent if gene flow 
between the species is ongoing. In such situations, most 
notably hybrid zones, hybridization allows alleles at neutral 
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loci to introgress and recombine, whereas loci under diver- 
gent selection will be slowed or even prevented from intro- 
gressing (e.g., Martinsen et al. 2001; Payseur and Nachman 
2005; Bull et al. 2006; Minder and Widmer 2008; Kane et al. 
2009; Kulathinal et al. 2009; Stolting et al. 2013). Loci with 
patterns of differentiation that exceed the neutral expecta- 
tion are candidates for underlying interspecific differences 
(Beaumont and Balding 2004; Storz 2005). Targets of selection 
may well be surrounded by an extended region of elevated 
differentiation, leading to the possibility that markers close to 
the target also show the signature of selection (Feder et al. 
2012). 

Altitudinal dines and hybrid zones present particularly 
convenient models to study adaptation and speciation 
because conditions and selective pressures change quite dra- 
matically over fairly short distances (Payseur 2010). This study 
focuses on a plant hybrid zone on the slopes of Mt. Etna, 
Sicily, where two closely related species of ragwort (Senecio, 
Asteraceae), adapted to high- and low-altitude environments, 
hybridize at intermediate altitudes, but remain phenotypically 
distinct. High-altitude Senecio aethnensis Jan ex DC. is en- 
demic to Mt. Etna, and is restricted to altitudes above 
~ 2,000 m. Low-altitude S. chrysanthemifolius Poir. is found 
on the lower slopes of Mt. Etna (typically below ~1,000m) 
and throughout Sicily, even reaching the southern tip of 
mainland Italy. Morphologically variable hybrids are found 
at intermediate altitudes on Mt. Etna (James and Abbott 
2005; Brennan et al. 2009), and for traits that differentiate 
the parental species (most conspicuously, leaf dissection, in- 
florescence size, and flowering time), dines are found as one 
moves from one end of an altitudinal transect to the other 
(Crisp 1972). Analysis of putatively neutral molecular markers 
also demonstrates ongoing hybridization and backcrossing, 
with markers apparently able to introgress with relative 
ease between the species (James and Abbott 2005; Brennan 
et al. 2009). These findings are in line with the observation 
that the species are completely interfertile (Chapman et al. 
2005; Brennan et al. 2013) and hybrids are typically vigorous 
and fertile in the greenhouse (Hegarty et al. 2009; Brennan 
et al. 2013). 

Comparisons of molecular and phenotypic dines in the 
hybrid zone suggest that the environment is playing a role in 
determining trait differentiation across the hybrid zone, with 
selection against hybrid genotypes playing a role at more local 
spatial scales (Brennan et al. 2009). A nuclear gene phylogeny 
suggests that these are sister species within a group of closely 
related Mediterranean Senecios (Chapman MA and Filatov 
DA, in preparation), and they diverged very recently, proba- 
bly within the last 150,000 years (11-72 thousand years ago 
[Ka] [Muir et al. 2013] or 130-176 Ka [Osborne et al. 2013]). 
Divergence with gene flow was supported (over a model 
without gene flow) by analysis of two transcriptome se- 
quences (Osborne et al. 2013). Migration rates between the 
species are also high (Muir et al. 2013), indicating that drift 
alone is not strong enough to create this pattern of interspe- 
cific divergence. Coupled with the extreme environmental 
differences between the high- and low-altitude sites, these 
species are likely candidates for having arisen via ecological 



speciation, that is, ecologically based divergent selection was a 
driving force in the speciation process (Nosil 2012). 

Given that differential expression can evolve rapidly be- 
tween closely related species (Whitehead and Crawford 2006; 
Pavey et al. 2010; Wolf et al. 2010) and can underlie species 
differences (Gompel et al. 2005; Abzhanov et al. 2006; Wray 
2007), expression divergence could play a central role in de- 
termining phenotypic differences observed between recently 
derived species, even when coding sequence divergence is not 
apparent. In closely related species that are still exchanging 
genes, such as here, fixed coding differences are likely to be 
rare, and instead, as a complement to the expression analysis, 
population genetic tests can be applied to sequence data to 
identify loci that show signatures of divergent selection. 
To study the genomic basis of adaptation during ecological 
speciation in Senecio, we therefore used a next-generation 
sequencing (NGS) approach to 1) quantify gene expression 
of multiple individuals of each of the species and 2) analyze 
DNA polymorphism throughout the transcriptomes and 
compare differentially and nondifferentially expressed genes. 
We demonstrate that a large proportion of the Senecio 
genome is apparently able to introgress, resulting in low av- 
erage interspecific differentiation, and we report evidence for 
differential expression and elevated differentiation in a subset 
of loci that are likely to be involved in adaptation to opposing 
selective pressures at high and low altitudes. 

Results 

Patterns of DNA Polymorphism 

Assembly of the Senecio transcriptome resulted in 18,797 
contigs of >500bp. Based on BlastX comparison to the 
Arabidopsis proteome, more than half of these contigs were 
>50% of the full length, and almost 30% were predicted as 
being approximately (>90%) full length (supplementary fig. 
S1, Supplementary Material online). After mapping reads 
from the 20 individuals (table 1) to the reference tran- 
scriptome and excluding loci for which more than 6 of the 
20 individuals were absent, we were left with 17,394 loci, 
totaling 22.63 MB. The alignments for the majority of these 
loci (16,052; 92.3%) contained portions of all 40 alleles se- 
quenced from all 20 individuals. 

Before excluding sites with missing data, 17,250 loci were 
variable in the total (S. aethnensis plus S. chrysanthemifolius) 
data set, while 16,920 and 16,608 loci were polymorphic 
within S. aethnensis and S. chrysanthemifolius, respectively. 
In total, we identified 354,815 single nucleotide polymor- 
phisms (SNPs), while species-specific data sets included 
274,002 and 216,999 SNPs within S. aethnensis and S. chry- 
santhemifolius, respectively (one SNP on average every 83 
and 104 bp; table 2). The overall level of sequence diversity 
across both species pooled (7T both ) was 0.00361 ± 0.0038 
(standard deviation [SD]), but differed significantly between 
the two species (7T aet = 0.00365 ± 0.0036 [SD], 7T chr = 0.00290 ± 
0.0035, paired t-test; t = 23.52; P= 1.8 x 10~ 119 ; table 2 and 
fig. 1). 

Excluding sites with missing data reduced the total 
number of sites in the entire aligned transcriptome to 6.13 
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Table 1. Locations of Senecio Populations from Which Achenes Were Collected for This Investigation. 


Code 


Species 


Collection Site 


Latitude 


Longitude 


Altitude (m) 


n 


BRO 


Senecio chrysanthemifolius 


Bronte, Sicily 


37.47° N 


14.50°E 


870 


2 


LIN 


S. chrysanthemifolius 


Linguaglossa, Sicily 


37.50° N 


15.07°E 


750 


2 


NIC 


S. chrysanthemifolius 


Nicolosi, Sicily 


37.37° N 


15.1°E 


850 


4 


RAN 


S. chrysanthemifolius 


Randazzo, Sicily 


37.57° N 


14.57°E 


763 


2 


PROI 


S. aethnensis 


Piano Provenzana, Sicily 


37.47° N 


15.1°E 


2036 


2 


PR02 


S. aethnensis 


Piano Provenzana, Sicily 


37.47° N 


15.1°E 


2036 


3 


ET1 


S. aethnensis 


Rifugio Piccolo, Sicily 


37.42° N 


14.59°E 


2097 


4 


ET3 


S. aethnensis 


Rifugio Piccolo, Sicily 


37.43° N 


14.59°E 


2471 


1 


GAL 


S. gallicus 


Amoreira, Algarve, Portugal 


37.21 °N 


8.50°W 


<50 


1 


VER 


S. vernalis 


Cyprus 


Unknown 


Unknown 


Unknown 


1 



Note. — Seed were collected by Adrian Brennan, with the exception of S. gallicus, collected by SJH, and S. vernalis which was donated by the Millennium Seed Bank Partnership. 
"n," number of individuals sequenced. 



Table 2. Summary Statistics for the Assembled Transcriptomes. 

Combined S. aethnensis S. chrysanthemifolius 

Full data set a 



Number of loci 


17,394 


17,325 


17,331 


Number of monomorphic loci 


144 


474 


786 


Number of alignment positions 


22,595,568 


22,601,091 


22,604,411 


Number of SNPs 


354,815 


274,002 


216,999 


7i±SD 


0.00361 ±0.0038 


0.00365 ±0.0036 


0.00290 ± 0.0035 


Tajima's D±SD 


0.039 ±0.98 


-0.040 ± 0.92 


-0.046 ±1.05 


Reduced data set b 








Number of loci 


8,854 


10,839 


9,821 


Number of monomorphic loci 


314 


456 


854 


Number of alignment positions 


6,131,166 


8,773,007 


7,054,456 


Number of SNPs 


87,766 


107,020 


67,541 


7T±SD 


0.00361 ±0.0026 


0.00367 ±0.0026 


0.00286 ± 0.0025 


Tajima's D±SD 


0.052 ±0.98 


-0.036 ±0.92 


-0.044 ±1.06 



a AII loci >500bp in length and with sequences from >70% of individuals. 

includes only loci with >100 gap-free alignment positions after removing sites with missing data. 



0.20 




Fig. 1. Sequence diversity of the S. aethnensis and S. chrysanthemifolius transcriptomes. The distribution of sequence diversity (tt, all sites) across the 
transcriptomes of Senecio aethnensis (blue) and S. chrysanthemifolius (red). 



MB and the total number of SNPs to 87,766 (table 2). Taking 
each species separately, this reduced the number of SNPs 
from 274,002 to 107,020 within S. aethnensis and from 
216,999 to 67,541 within S. chrysanthemifolius (table 2). 
Subsequent exclusion of loci with fewer than 100 gap-free 
alignment positions reduced the total data set to 8,854 



alignments for the combined data set, and sequence diversity 
was again significantly different between the species (7T aet = 
0.00367 ± 0.0026 [SD], 7T chr = 0.00286 ± 0.0025, paired t-test; 
t = 34.19; P = 2.4x 10~ 236 ; table 2). The same pattern was 
also found for different types of sites (noncoding, synony- 
mous etc.; paired t-test; all P< 10 -5 ; fig. 2). Within species, 
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the same cutoff reduced the numbers of data sets to 10,839 
for S. aethnensis and 9,821 for S. chrysanthemifolius. In line 
with the above observations of reduced polymorphism in S. 
chrysanthemifolius, the proportion of monomorphic loci was 
twice as high in this species (854; 8.70% of 9,821 loci) com- 
pared with S. aethnensis (456; 4.20% of 10,839 loci). 

Across both species combined as well as within each spe- 
cies, Tajima's D was close to zero (D both = 0.052 ± 0.98; 
D aet = -0.036 ± 0.92, D chr = -0.044 ± 1.06). Although the dis- 
tribution was not significantly different between S. aethnensis 
and S. chrysanthemifolius (paired t-test; t- 1.575; P>0.1), 
S. chrysanthemifolius showed a broader spread of values 
(table 2 and fig. 3). 

Linkage Disequilibrium 

One caveat of carrying out simultaneous analysis of an entire 
transcriptome is that loci residing in the same region of the 
genome may be nonindependent in an evolutionary sense 
due to linkage. Positive selection at one site, for example, can 
affect frequencies of alleles at any sites that are in linkage 
disequilibrium (LD) (e.g., Maynard-Smith and Haigh 1974). 
To determine whether nearby SNPs were likely to be in LD 
with each other, we inferred the distance at which LD breaks 
down within the loci. Despite this analysis being based on 
intronless sequences and therefore the distances we used are 
likely to be underestimated in many cases, this is unlikely to 
be a very large difference as total intron length per gene in 
plants is on average <2 kb (Carels and Bernardi 2000; Hong 
et al. 2006; Jaillon et al. 2007; You et al. 2009). Above 7.5 kb, 
there was no evidence for pairs of SNPs to demonstrate LD; 
therefore, LD is only likely to affect SNPs within loci (supple- 
mentary fig. S2, Supplementary Material online), and thus the 
vast majority of loci are likely to be independent in evolution- 
ary sense. 

Genetic Differentiation and Demographic Inference 
Almost half of the SNPs (40,094; 45.7%) were shared between 
the two species, and only 98 fixed differences, in only 57 loci, 



were observed. Overall, genetic differentiation of S. aethnensis 
and S. chrysanthemifolius is quite low (for a between-species 
comparison), with mean and median F ST values of 0.196 
(±0.191 [SD]) and 0.146, respectively (fig. 4). Nevertheless, 
the STRUCTURE analysis revealed K=2 as the most likely 
number of clusters, clearly subdividing the 20 individuals into 
their specific groups, with only two individuals showing any 
evidence of admixture (fig. 5). 

To infer past demography of S. aethnensis and S. chry- 
santhemifolius, we fit population split with migration 
models to the data using dadi (Gutenkunst et al. 2009). 
This approach is based on comparison of the observed and 
modeled site frequency spectra (SFS) (fig. 6). To distinguish 
the ancestral and derived alleles for each SNP, we used tran- 
scriptome sequences from two outgroup species (see 
Materials and Methods). To minimize the effects of selection 
on our demographic inference, only SNPs at 4-fold degenerate 
sites were used (39,635 SNPs from 4,610 loci that had at least 
500 alignment positions with no missing data). We tried sev- 
eral models implemented in dadi and chose the model with 
the least number of parameters that was appropriate for our 
data; the isolation-with-migration model with population size 
change (IMpre). In this model, an ancestral population of size 
N a undergoes an instantaneous size change to size N b at time 
T b . Then at time T s it splits in two with size s x N b and 
(1 — s) x N b (where s is the fraction of N b that forms popu- 
lation 1). Each is then allowed to exponentially change in size 
to the current effective population sizes (N-i and N 2 ) and 
exchange migrants at rates A/1-|^ 2 ar| d M 2 ^i (fig. 7). 

The above parameters in the model (except s) are scaled in 
units of 2N a , and thus one has to assume some value for the 
ancestral population size to convert into more conventional 
units. Using different approaches, two previous studies esti- 
mated N a to be 227,837 (range ~ 129,000-475,000; Muir et al. 
2013) and 312,023 (±9,833 [standard error]; Osborne et al. 
2013). Assuming N a = 300,000 (see Discussion for validation) 
and a generation time of 2 years, the time of species split is 
estimated to have occurred 107,571 years ago (bootstrap 
range [BR]: 52,853-186,785) and the contemporary 
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Fig. 2. Sequence diversity is significantly lower in S. chrysanthemifolius relative to S. aethnensis. Mean (±SD) sequence diversity per locus across all sites 
(All) and across groups of sites (1st and 2nd coding positions, 4-fold degenerate coding positions and noncoding positions) is given. For each test S. 
chrysanthemifolius was lower than S. aethnensis (paired t-test; all P < 0.0001). 
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Fig. 3. Tajima's D across the transcriptomes of S. chrysanthemifolius shows a broader spread of values than in S. aethnensis. The proportion of loci that 
fall within each bin is given per species. Mean Tajima's D was not significantly different between the species (paired t-test; P> 0.1). 
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Fig. 4. F ST is strongly skewed toward low values across all loci; however expression outliers showed a greater proportion of loci with high F ST . The 
expression outliers with sufficient sequence information (n - 26 loci; see text) are plotted in green and nonoutliers in black. 
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Fig. 5. Senecio aethnensis and S. chrysanthemifolius are genetically differentiated. STRUCTURE was used to determine that the most likely number of 
genetic clusters was 2 (blue and red). All S. chrysanthemifolius individuals ("c_xxx") fall into cluster 2 with >99.9% ancestry; however, for two S. 
aethnensis individuals, ("a_xxx") slight (<20%) mixed ancestry was uncovered. Individuals are named according to Table 1. 



population sizes of S. aethnensis and S. chrysanthemifolius are 
54,027 (BR: 26,448-89,856) and 39,846 (BR: 19,987-63,642), 
respectively (fig. 7). There appears to be a slight overall reduc- 
tion of population size compared with the ancestral species, 



which was also reported by Muir et al (2013). This is due to 
the reduction in population size of S. aethnensis because S. 
chrysanthemifolius has apparently shown an increase in pop- 
ulation size recently (fig. 7). Interestingly, the IMpre model 
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Fig. 6. The fit of the IMpre demographic model to the data summarized by the observed and modeled 2D-SFS. (A and B) Observed (A) and modeled 
(B) 2-D SFS for S. aethnensis (Y axis) and S. chrysanthemifolius (X axis). The scale on the right indicates the number of polymorphisms in each frequency 
class (squares of the grid). (C) Anscombe residuals between the model and the data where blue and red (negative and positive values) indicate the 
model predicts too few and too many polymorphisms in a given grid square, respectively. (D) The distribution of the residuals. 



indicates a very uneven (94% to 6%) split of the ancestral 
population at time T s into S. aethnensis and S. chrysanthemi- 
folius (fig. 7), which suggests that the latter originally was 
founded as a small population at the margin of the species 
range. 

Gene flow from S. aethnensis to S. chrysanthemifolius 
(M 2 ^ =9.64 [BR: 5.35-16.81]) appears to be higher than in 
the opposite direction (M^ 2 = 4.94 [BR: 2.70-9.09]). Overall, 
migration rate in both directions is quite high and is signifi- 
cantly different from zero; forcing the migration parameters 
to equal zero dramatically reduces the likelihood of the 
model (model with bidirectional migration log likelihood 
[InL] = —2,157.64; models with migration in only one direc- 
tion ln/_ = —13,756.64 and —3,533.61; model with no migra- 
tion InL = —21,738.14; likelihood ratio test for all comparisons 
P < 0.001). This high migration rate is not unexpected, given 
the field observations of extensive hybrid swarms at interme- 
diate altitudes (e.g., Ross 2010 and personal observations). 

Global Selective Sweeps 

A mutation that is adaptive across the entire range of pop- 
ulations interconnected by gene flow is expected to spread 
throughout the species range, even if gene flow between 
populations (or species/subspecies/races) is very low 
(Slatkin and Wiehe 1998). Unlike diversifying selection and 
local adaptation, "global sweeps" are expected to eliminate 
genetic differentiation between the populations or species. 
Such sweeps are expected to leave distinctive footprints in 



DNA polymorphism, such as reduced variation, biased fre- 
quency spectra, and elevated LD across the entire range of 
hybridizing populations (Oleksyk et al. 2010). 

Comparative analyses of the patterns of polymorphism 
and deviations from neutral allele frequency spectra, as well 
as explicit tests for nonneutrality, are therefore instructive in 
this regard. First, we identified 314 loci (3.5% of 8,854) that 
contained no polymorphism in the sequences from both spe- 
cies. To test whether low diversity in these loci is incompatible 
with neutrality, we used a maximum likelihood version of the 
Hudson-Kretiman-Aguade (HKA; Hudson et al. 1987) test 
(Wright and Charlesworth 2004), integrated in Proseq 
(Filatov 2009). Maximum likelihood (ML)-HKA tests rejected 
neutrality (P < 0.05) for 22 of the loci (7.0%). Second, we 
investigated the patterns of polymorphism in the (nonmono- 
morphic) loci in the lowest 5% tail of the distribution of av- 
erage heterozygosity (n). Average n in this subset of loci was 
only 0.00034 ± 0.00015 (SD). Eighty eight (>20%) of these had 
among the lowest (most negative 5%) Tajima's D, suggestive 
of recent selection, and eight were in the lowest 5% of Fay and 
Wu's H distribution, specifically indicative of recent (i.e., <0.1 
Ne; Przeworski 2002) positive selection at these loci. Taken 
together, these analyses suggest that very low or zero poly- 
morphism in a subset of loci is due to recent selection. 

Diversifying Selection 

Diversifying selection at a locus is expected to increase differ- 
entiation and inflate F ST . This signal is used by the BayeScan 
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Fig. 7. Graphical summary of the most likely inferred demographic scenario of speciation. This scenario was inferred with the IMpre model imple- 
mented in dadi. The parameters are described in the text. Time (X axis) and each population size (Y axis) are drawn to scale based on the mean 
parameter estimates. (Inset) Box and whisker plots of the parameter estimates from 300 bootstrap replicates. Boxes indicate the 95% confidence 
intervals, whiskers indicate the range, and the dashed line indicates the mean estimate. Note the log 10 scale for the parameters T b and T s . 



software (Foil and Gaggiotti 2008; Fischer et al. 2011), which 
identified 199 positive outlier loci (i.e., greater than expected 
differentiation) with 1% false discovery rate (FDR). Although 
BayeScan does not simply identify the loci with the highest 
F ST , average F ST for these positive outliers was 0.657 ±0.193 
(SD), significantly greater than the remainder of the loci 
(Mann- Whitney U = 8.15 x 10 4 , z= -21.77, P = 4.02 x 
10 -105 ). Average divergence, D xy , was also significantly 
higher in the BayeScan outliers (0.0044 ± 0.0026 [SD] 
vs. 0.0034 ± 0.0022; Mann- Whitney U = 5.876 x 10 5 , 
z = —6.278, P = 3.43 x 10 -10 ) suggestive of selection driving 
divergence in this subset of loci. Out of 98 SNPs fixed between 
the two species, 49 were found in the 199 BayeScan outlier 
loci. 

An additional 369 loci were deemed negative outliers, 
which may contain a subset of loci with the signature of 
balancing selection. These loci had a significantly lower F ST 
than the remainder of the loci (0.099 ±0.092 [SD] vs. 
0.200 ±0.193; Mann-Whitney U= 1.059 x 10 6 , z= -9.678, 
P = 3.74x 10~ 22 ). 



Within the positive outliers are a number of loci with 
specific signatures of diversifying selection or species-specific 
adaptation. Average heterozygosity (n) for the BayeScan out- 
liers was significantly lower than the nonoutlier loci in 
S. chrysanthemifolius (0.00060 ± 0.0009 [SD] vs. 0.00290 ± 
0.0025; Mann-Whitney U = 2.572 x 10 5 , z = -16.94, 
P = 2.10 x 10 -64 ) and marginally so in S. aethnensis 
(tt = 0.00318 ±0.0022 [SD] vs. 0.00368 ± 0.0026; Mann- 
Whitney U = 7.669 x 10 5 , z = -2.643, P = 0.0082). This reduc- 
tion in the level of polymorphism (when each species was 
compared with that expected from divergence from out- 
group S. vemaWs) was significant in 35 loci (ML-HKA test, 
P<0.05): 21 in S. chrysanthemifolius, 11 in S. aethnensis, 
and 3 loci in both species. Moreover, a complete absence of 
polymorphism was found for 68 of the 199 loci (34.2%) in S. 
chrysanthemifolius, significantly more than expected from the 
genomewide pattern (Fisher's exact test P = 8.690 x 10 -11 ), 
while only eight (4.0%) were monomorphic in S. aethnensis 
(Fisher's exact test P = 0.325). Three of the 199 outlier loci 
were monomorphic in both species (with one to three 
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SNPs differentiating the alleles). The loci showing elevated 
differentiation coupled with reduced polymorphism in one 
or both species represent some of the strongest candidates 
for diversifying selection or parallel adaptation. 

Species Differentiation in Gene Expression 
Expression (reads per kilobase of exon model per million 
mapped reads [RPKM]) was calculated for each of 18,797 
contigs in 20 separate individuals, and t-tests were used to 
identify significantly different expression between species (see 
Materials and Methods; supplementary table S1, Supplemen- 
tary Material online). With an FDR value of 5%, significantly 
different expression was detected for 64 contigs, with 29 and 
35 being expressed at a significantly higher level in S. aeth- 
nensis and S. chrysanthemifolius, respectively. The NOISeq test 
(Tarazona et al. 2011) for differential expression revealed a 
smaller number of gene candidates (32), with 9 expressed at a 
higher level in S. aethnensis and 23 higher in S. chrysanthemi- 
folius. Six loci were detected as showing significant expression 
differences in both the t-test and using NOISeq, giving a total 
of 90 expression candidates. The fold difference of the expres- 
sion candidates between S. aethnensis and S. chrysanthemifo- 
lius was sometimes extreme, >100x in 11 cases (see 
supplementary table S1, Supplementary Material online). 

There was little overlap for the differentially expressed loci 
identified by this study and the previous microarray investi- 
gation (Hegarty et al. 2009); of the 100 differentially expressed 
transcripts identified by Hegarty et al. that were identified in 
our transcriptome, only two (see supplementary table S1, 
Supplementary Material online) were differentially expressed 
in our RNA-Seq investigation. The Hegarty et al. (2009) inves- 
tigation was based on a different tissue mix (two different bud 
stages only) and pooled samples from the different taxa, and 
hence represents more individuals per species but in the ab- 
sence of biological replicates (see Discussion). Thus, the two 
analyses are not strictly comparable. 

The genes differentially expressed in S. aethnensis and 
S. chrysanthemifolius showed evidence for higher genetic dif- 
ferentiation at the sequence level relative to the remainder of 
the loci. Initially, divergence estimates (F ST and D xy ) were 
available for only 17 of the expression candidates because 
these differentially expressed genes often have very low cov- 
erage (and hence a lot of missing data) in one of the species. 
We therefore polymerase chain reaction (PCR) amplified and 
Sanger-sequenced a further nine expression outlier loci from 
genomic DNA from the same individuals. F ST between S. 
aethnensis and S. chrysanthemifolius was significantly greater 
for the 26 expression outliers than the nonoutliers 
(0.348 ±0.226 [SD] vs. 0.195 ±0.191; Mann-Whitney 
U = 6.594 x 10 4 , z=-3.57, P = 0.0004; fig. 4). This was also 
the case when comparing just the initial 17 loci to the 
nonoutliers (F ST = 0.344 ± 0.242; P = 0.0081) as well as the 
nine loci amplified by PCR (F ST = 0.355 ± 0.205; P = 0.0160). 
Similarly, D xy , the average number of pairwise differences be- 
tween the alleles from the two species, was significantly higher 
for the differentially expressed loci (0.0070 ± 0.0051 [SD] vs. 



0.0035 ± 0.0022; Mann-Whitney U = 6.015 x 10 4 , z = -4.031, 
P = 5.547 x 10~ 5 ). 

Despite this trend, only one of the expression candidates 
(a DNAJ heat-shock protein homologue) was also a positive 
BayeScan-based candidate (i.e., outlier with increased inter- 
specific divergence); however, only 17 of the expression out- 
liers had sufficient sequence information to be included in the 
BayeScan analysis. This and a further three of the 17 expres- 
sion candidates exhibited an F ST in the top 5% of the tran- 
scriptomic distribution, comprising a serine/threonine kinase, 
a methyltransferase, and a pre-mRNA-splicing factor. 

Across both species combined, sequence diversity (n) for 
the expression outliers was higher than for the nonoutliers 
(Mann-Whitney U = 5.476 x 10~ 4 , z= -2.29, P = 0.02223), 
but not so for the within species diversity (all P>0.1). 
Tajima's D was also not significantly different for the expres- 
sion outliers (all P > 0.1 ); however, it should be noted that this 
is based on a small number of outliers. 

In addition, ML-HKA tests were conducted for each species 
separately for the expression candidates that also had a 
S. vernalis sequence. For the species-specific samples, the 
number of differentially expressed loci with >100 gap-free 
sites increased (compared with the two species sample) to 
19 for S. aethnensis and 30 for S. chrysanthemifolius. For eight 
loci, the ML-HKA was significant (three loci in S. aethnensis, 
four loci in S. chrysanthemifolius, and one locus in both spe- 
cies), suggesting that directional selection for differential 
expression had taken place in these species at these loci. 

Types of Loci under Selection 

Analysis of gene ontology (GO) terms revealed significant 
(P = 0.00048) overrepresentation of genes involved in treha- 
lose biosynthesis among the positive BayeScan outliers, i.e., 
those showing the signature of adaptive differentiation. 
Trehalose is a disaccharide that plays a role in desiccation 
tolerance in bacteria and yeast (Crowe et al. 1992), and, al- 
though trehalose is not thought to accumulate in the major- 
ity of plants (the exception being "resurrection plants" (e.g., 
Vicre et al. 2004), it may be that genes involved in trehalose 
biosynthesis are under selection at the sequence level due to 
differing drought conditions of the two habitats. Using an 
FDR of 5%, this GO term was, however, not significant. 

GO annotation of the expression candidates revealed an 
overrepresentation of genes involved in primary amine oxi- 
dase activity (P = 0.00057) and magnesium ion binding 
(P = 0.00680), but these again were not significant when a 
5% FDR was used. The only differentially expressed gene 
that was also among the positive Bayescan outliers (con tig 
14541) encoded a DNAJ heat shock N-terminal domain-con- 
taining protein, thought to be involved in response to heat 
stress, which may reflect adaptation of low-altitude S. chry- 
santhemifolius to hotter, more arid conditions at the bottom 
of Mt. Etna. Functional annotation of further differentially 
expressed genes with high F ST (in top 5%) or reduced DNA 
polymorphism (significant ML-HKA) provided no obvious 
connections to adaptations to altitude. 
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Discussion 

During speciation, a heterogeneous pattern of genomic dif- 
ferentiation can build up in response to divergent selection of 
adaptive alleles at a handful of loci, while the rest of the 
genome is largely free to exchange alleles (Wu 2001). This 
heterogeneous pattern may become more prominent if the 
recently diverged species come back into contact and are free 
to exchange alleles throughout the genome, while maintain- 
ing their interspecific differences due to the reduced or pre- 
vented introgression of locally adaptive alleles (Payseur 2010). 
There are only a few examples of analyses of clear-cut eco- 
logical speciation with ongoing gene flow (reviewed in Funk 

2009) , and the majority of research has been carried out on 
animal systems (Nosil 2007; Schwarz et al. 2009; Renaut et al. 

2010) and hardly on plants (however, see e.g., Schemske and 
Bradshaw 1999; Savolainen et al. 2006; Papadopulos et al. 201 1 
for plant systems). Such focus on animals in this field is some- 
what surprising, as the pervasiveness and importance of gene 
flow between closely related species has long been recognized 
in the plant literature (e.g., Anderson 1949; Anderson and 
Stebbins 1954; Stebbins 1959; reviewed in Rieseberg 1995; 
Arnold 1997). 

Ecological Speciation in Senecio 

We undertook a transcriptomewide analysis of polymor- 
phism and divergence in a recently derived pair of plant spe- 
cies, which are thought to have diverged through ecological 
speciation. By comparing genomewide and locus-specific pat- 
terns of sequence variation and gene expression, we can begin 
to understand how these species formed and/or how they are 
being maintained in the face of ongoing interspecific gene 
flow. The study system herein concerns the sister species 
Senecio aethnensis and S. chrysanthemifolius (Asteraceae) 
that form a hybrid zone on Mt. Etna, Sicily. Contrasting en- 
vironments of high and low altitudes create strong divergent 
selective pressures in geographically proximal (within a few 
kilometer) populations of these species. 

Previous demographic inferences indicate a relatively 
recent (~35-150 Ka) split of S. aethnensis and S. chrysanthe- 
mifolius, a smaller population size in the latter species, and 
extensive interspecific gene flow (Muir et al. 2013; Osborne 
et al. 2013). The genomewide data reported in the current 
study is consistent with this scenario, with our estimate of the 
time of species split (~108Ka) falling in between the two 
previous estimates. Our relatively high estimate of interspe- 
cific gene flow demonstrates that the gene pools of the two 
species are actively mixing and that without diversifying se- 
lection at genes involved in adaptation to contrasting envi- 
ronments, the distinctiveness of these species would be wiped 
out by ongoing gene flow (Wright 1931; Slatkin 1985). 

Our estimates of the split time and population sizes are 
based on the assumption of an ancestral population size, N a , 
and assuming smaller or larger values would proportionately 
reduce or increase the estimates. N a = 300,000 was based on 
the previous literature (Muir et al. 2013; Osborne et al. 2013), 
and furthermore, this estimate gives reasonable values for the 
mutation rate, /x, calculated from the scaled mutation rate, 0 



(=4N a /x), estimated in this model. Assuming N a = 300,000 re- 
sults in an estimate of /x (4.27 x 10 -9 ; bootstrap range, BR: 
2.42 x 10 -9 to 7.73 x 10 -9 ), which is roughly compatible 
with an Asteraceae-specifk mutation rate (/x~1x10 -8 ; 
Strasburg and Rieseberg 2008) and a generic plant mutation 
rate (/x ~ 5 x 10~ 9 Wolfe et al. 1987). 

Our demographic inference using dadi confirmed the pre- 
vious conclusion that S. aethnensis has a somewhat larger 
effective population size compared with S. chrysanthemifolius 
(Muir et al. 2013), which is also in line with various summary 
statistics reported here (table 2). In particular, DNA polymor- 
phism in S. aethnensis was significantly higher than in S. chry- 
santhemifolius (figs. 1 and 2), and the former species 
contained less monomorphic loci (4.2%) than the latter 
(8.7%). This may seem somewhat surprising given the wider 
extant distribution of S. chrysanthemifolius compared with 
S. aethnensis, which is restricted to the upper part of Mt. 
Etna; however, the past distributions and population sizes 
of the two species may have been very different. First, the 
split of the ancestral population at time T s gave rise to an 
ancestral S. aethnensis population many times larger than 
that of S. chrysanthemifolius (fig. 7); therefore, it seems likely 
that the now relatively widespread S. chrysanthemifolius orig- 
inated as a small marginal population. Second, because 
S. aethnensis is better adapted to colder conditions, it is 
likely to have been much more widespread during the glacial 
cycles of the Pleistocene, whereas S. chrysanthemifolius may 
have been more restricted in its distribution compared to 
now, as seen in other Sicilian plant taxa (Sadori et al. 2008). 
Since the beginning of the Holocene ~ 12,000 years ago, 
S. aethnensis is likely to have become restricted to its current 
high altitude refuge on Mt. Etna, and S. chrysanthemifolius 
may have been able to colonize a larger area of Sicily. This 
scenario is in line with the overall reduction in population 
size of S. aethnensis and the increase in population size of 
S. chrysanthemifolius since their split around 108,000 years ago 
(fig. 7). 

A Heterogeneous Pattern of Genomic Differentiation 
In addition to the proximity of two species, factors such as 
seed dispersal by wind, the constant availability of empty 
niches due to volcanic activity, and an apparent lack of any 
reproductive barriers (Chapman et al. 2005) should provide 
ample opportunities for interspecific hybridization and 
homogenization of the genomes of S. aethnensis and S. chry- 
santhemifolius. Hybrid populations flourishing at intermedi- 
ate altitudes may also act as a bridge (or rather stepping 
stones) for interspecific gene flow (Brennan et al. 2009). In 
this respect, our findings of relatively low genomewide genetic 
differentiation (only 98 fixed SNPs in ~ 10,000 genes analyzed), 
extensive sharing of SNPs (45% of all SNPs were shared across 
both species) and a high migration parameter (2N a m = 4.94- 
9.64) are not entirely unexpected. In addition, a small number 
of loci potentially show the signature of a global selective 
sweep in which an allele has risen to fixation (or almost so) 
across both species. Morjan and Rieseberg (2004) suggested 
that the collective evolution of species exchanging adaptive 
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alleles may be fairly widespread, but, at the time, molecular 
genetic data from nuclear protein coding genes supporting 
this was lacking. Our data, with caveats (see later), suggest the 
presence of global (transspecifk) sweeps in these two species. 

Despite extensive gene flow, the species, however, main- 
tain their phenotypic (and genotypic; fig. 5) integrity at the 
extremes of the transect, and therefore a portion of the ge- 
nomes is presumably being prevented from introgressing (in 
contrast to the remainder of the genome), due to strong, 
most likely ecological-based, selection. Indeed, this is sup- 
ported by our data, with a subset of loci showing significantly 
elevated differentiation. The pattern of polymorphism across 
the transcriptome sheds light on the process of ecological 
speciation. 

It appears that only a small subset of the genome harbors 
evidence for playing a role in the specific differences between 
these species, with only 90 loci (of ~ 18,800) showing gene 
expression divergence and 199 loci showing significantly 
elevated differentiation (at FDR < 0.01). This fraction of se- 
quence-based outliers (199/8,540; 2.33%) is low compared 
with other studies (reviewed in Strasburg et al. 2012); how- 
ever, comparisons between studies are difficult to make be- 
cause of differences in methodologies and significant cutoffs 
(Nosil et al. 2009). For example, our study utilized coding 
region polymorphisms, whereas most related studies use 
anonymous markers (Strasburg et al. 2012). Similarly, our 
study used a very large number of loci at the expense of 
the number of individuals analyzed. It remains to be seen 
whether other such large-scale investigations of coding 
region polymorphism give similar proportions of outlier loci. 

Evolution of Gene Expression 

The analysis of gene expression under controlled conditions 
revealed a handful of loci differentially expressed between the 
two species, which is consistent with the view that gene 
expression differences can evolve quickly between species 
or differentially adapted populations (Whitehead and 
Crawford 2006; Wolf et al. 2010). Evolution of such loci 
would require fixation of c/s-mutations in the regulatory 
regions and/or evolution of a gene that acts in trans. Under 
c/s-regulatory evolution, one would expect to detect higher 
differentiation as well as signatures of selective sweeps in the 
5' regions of these loci. Because of the nature of RNA-seq, our 
data did not allow us to investigate the upstream regions of 
these loci, and, as mentioned previously, differentially ex- 
pressed genes usually had very low expression in one of the 
species. Despite this, the initial 17 expression outliers with 
sufficient sequence data in both species did show increased 
interspecific differentiation (F ST and D xy ) relative to the 
genome as a whole, and this was corroborated when a further 
nine outlier loci were PCR amplified and sequenced (fig. 4). 
This was also found when six differentially expressed loci 
(identified in Hegarty et al. 2009) were compared with 11 
genes with no evidence of differential expression (Muir 
et al. 2013). This pattern could be due to adaptive divergence 
in a proximal region that is responsible for regulating gene 
expression (promoters or nearby enhancers that were not 



sampled using RNA-seq) and a parallel increase in differenti- 
ation of the coding region due to linkage to the adaptive 
mutations. A subset of the expression candidates did show 
departure from neutrality in the coding regions in the 
ML-HKA test in one or other species, and this ad hoc test 
therefore informs us about which species is likely to have 
experienced the selection for altered expression. 

GO terms related to primary amine oxidase activity and 
magnesium ion binding appear to be overrepresented among 
the differentially expressed genes. It is possible that differential 
metal ion deposition through volcanic activity (e.g., Aiuppa 
et al. 2006) is imposing a selection pressure, although, at this 
stage, it is too early to say that this is the reason for overrep- 
resentation of loci involved in ion binding. 

Only two loci were found to be differentially expressed in 
both our study and a previous microarray analysis of gene 
expression differences between these two species of Senecio 
(Hegarty et al. 2009). This is probably due to differences in 
tissues analyzed and methodology. Hegarty et al. (2009) used 
RNA extracted from capitulum buds and flower buds, 
whereas our analyses used RNA extracted from capitulum 
buds, leaf, and stem tissue; thus, the two analyses are not 
strictly comparable, so it is perhaps not surprising that 
more loci in common were not identified. Further, Hegarty 
et al. did not seek to explore intraspecifk variation in gene 
expression, but rather created an "average" transcriptome by 
pooling RNA from multiple individuals to identify gross spe- 
cies-specific differences between the two species. Such an 
approach without individual-level analysis could lead to the 
identification of false positives, for example, if a gene was very 
actively expressed in only one individual out of the pool, its 
transcript abundance would appear very different in the two 
species. Our NGS RNA-seq analysis overcomes this issue be- 
cause we control for variation in expression among individ- 
uals within the species. Thus, the two loci identified as being 
differentially expressed by both microarray and RNA-seq are 
particularly interesting, together with the other candidate 
transcripts identified by RNA-seq analysis. These 90 candidate 
genes will be the subject of future studies of the evolution of 
differential gene expression in Senecio. 

Caveats and Cautions 

A study of this magnitude may be affected by various types of 
errors, and it is important to consider their sources and pos- 
sible effects on our conclusions. Errors in both sequencing and 
read mapping may result in an excess of low-frequency var- 
iants in the data, which would lead to biases in demographic 
inferences. There is no evidence of an excess of low-frequency 
variants in our data, if anything there may be a slight lack of 
such sites (see supplementary fig. S3, Supplementary Material 
online). If, however, the rare variants were clustered in just a 
few genes, then this pattern of polymorphism may mimic 
recent selective sweeps, which could partly account for the 
signature of transspecifk selective sweeps we detected in the 
data. 

LD is heterogeneous across the genome (e.g., Kim et al. 
2007), and a localized genomic region of increased LD would 
cause hitchhiking of many genes linked to target of selection; 
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however, this is not anticipated to have a major influence on 
the data we present. As we demonstrated earlier, the extent 
of LD in the genomes of these Senecios is low (supplementary 
fig. S2, Supplementary Material online), with LD likely to be 
limited to SNPs within loci and not expected to extend to 
neighboring loci. 

Finally, we acknowledge that transcriptomics only captures 
the transcribed portions of the genome that are expressed at 
a sufficient level in the tissues we chose to analyze. Future 
analyses aimed at analyzing the signature of selection 
throughout the remainder of the genome will shed light on 
the contribution of selection on coding versus regulatory 
DNA during speciation and adaptation. 

Conclusions 

This study represents one of the first transcriptomewide anal- 
yses of adaptation to contrasting environments during 
ecological speciation in plants. Our results suggest that the 
evolution of morphological and ecological differences be- 
tween the two Senecio species on Mt. Etna occurred over a 
relatively short timescale, and it was accompanied by diver- 
sifying or species-specific selection in a small proportion 
(<3%) of the sequenced loci. Selection at this small portion 
of the genome is apparently sufficient for the species to main- 
tain their phenotypic and genotypic differences on Mt. Etna, 
despite a high level of ongoing gene flow that has the poten- 
tial to homogenize the rest of the genome. 

Materials and Methods 

Study Species, RNA Extraction, and Sequencing 
Achenes (single-seeded fruit) of S. aethnensis and S. chry- 
santhemifolius were collected on Mt. Etna from wild plants 
at the extremes of the altitudinal dine to ensure the best 
approximation of species genetic "purity" (table 1). All 
S. aethnensis achenes were collected from above 2,000 m. 
within ~4 km of the summit, and achenes of S. chrysanthe- 
mifolius were taken from a range of populations at the base of 
Mt. Etna (all ~700-800m above sea level [a.s.l.]) ~12-15km 
from the summit to the north, west, south, and northeast. 
Achenes of the outgroups (based on Comes and Abbott 
2001) S. vernalis and S. gallicus, respectively, were supplied 
by the Millennium Seed Bank Partnership (http://data.kew. 
org/seedlist/index.html, last accessed October 9, 2013) and 
collected by S.J. Hiscock. Achenes were germinated on 
damp filter paper and seedlings transferred to a soil/vermic- 
ulite mix in identical-sized pots in a growth room set at 19- 
21 °C with a 16-h photoperiod. Ten plants of each focal spe- 
cies and a single outgroup were selected, and RNA was ex- 
tracted from the plants as the first inflorescence opened. The 
apical tissues (comprising inflorescence, stem, and uppermost 
leaf) were removed from the plants and immediately frozen 
in liquid nitrogen prior to RNA extraction with the RNeasy kit 
(Qiagen, Crawley, UK). The libraries for high-throughput se- 
quencing of the polyA fraction of the transcriptome were 
prepared and sequenced at the Genomics facility at the 
Wellcome Trust Centre for Human Genetics (WTCHG), 
Oxford. All sequencing was conducted on HiSeq2000 



lllumina machines with multiplexing several libraries per 
lane. At least 12 million 50 bp end reads were generated for 
each of the libraries (see supplementary table S2, 
Supplementary Material online, for details). 

Assembly and Annotation 

Our data processing and assembly strategy were similar to 
that we used in our previous work (Chibalina and Filatov 
2011). lllumina reads for each individual transcriptome were 
imported into CLC Genomics Workbench (CLC-GW; ver. 5; 
http://www.clcbio.com, last accessed October 9, 2013), 
trimmed of adapters and filtered for low-quality reads with 
default settings. Because we were interested in differential 
expression, it is possible that some of the most interesting 
genes would be unaccounted for in the sequence reads of a 
single individual (if expression was very low or zero in some 
individuals). We therefore assembled a "hybrid tran- 
scriptome," comprising reads from one individual of each 
species. This transcriptome was only used as a reference for 
mapping short reads from each of the transcriptomes, and it 
was not used in any of the DNA polymorphism/divergence 
analyses described below. This reference Senecio transcrip- 
tome was assembled de novo in CLC-GW with high strin- 
gency to reduce the likelihood of paralogous transcripts 
assembling as a single transcript (mismatch cost 3, insertion 
cost 3, deletion cost 3, minimum identity 0.95). The assembly 
process included a scaffolding step based on paired end reads. 
Contigs less than 500 bp were not considered further. 

The coding regions and associated GO terms for individual 
cDNA contigs were identified using Blast2GO (Conesa et al. 
2005; Gotz et al. 2008). The XML output from Blast2GO was 
imported to ProSeq (Filatov 2009) and assigned to the refer- 
ence transcriptome as well as to all the sequences in the 
alignments (described later). The scaffolding stage of the as- 
sembly resulted in short stretches of unknown sequence in 
places. The length of such regions was estimated by the scaf- 
folding algorithm in CLC-GW based on distance between 
paired reads. Occasionally, these poly(N) regions caused a 
frameshift in the open reading frame (ORF), which was de- 
tected and resolved in ProSeq (Filatov 2009) by adding addi- 
tional Ns to restore the ORF. At this stage, a small number of 
contigs with nonplant top BLAST hits and/or more than one 
ORF were removed, resulting in the final set of reference 
cDNA contigs. The completeness of the reference assembly 
was checked by BlastX comparison of each contig to the 
TAIR10 (The Arabidopsis Information Resource) Arabidopsis 
proteome using BLAST + (ver. BLAST2.2.25 + ; Camacho 
et al. 2009). 

Reads from each of the 20 individuals (plus the outgroup) 
were mapped onto the reference in CLC-GW, one library at a 
time, to produce consensus contigs. The same stringency 
settings were used as for the initial assembly. The files were 
exported from CLC-GW as BAM files and, using a custom perl 
script (available on request), parsed through samtools (Li et al. 
2009) using the commands "mpileup," "vcf2fq," and "fq2fa" to 
align the reads to the reference, create consensus (with het- 
erozygous bases encoded with IUPAC code) for each locus, 
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and to save sequences in fasta format. Minimum base quality 
was set to 20, minimum mapping quality to 5, and minimum 
read depth to 5. The fasta files from samtools were imported 
into Proseq (Filatov 2009), aligned to the reference, and 
coding regions copied across to all alleles. Heterozygous 
bases were phased by exporting each alignment as a *.inp 
file, running each alignment through fastPHASE (Scheet 
and Stephens 2006) using a custom perl script, and importing 
the resultant phased *.out files back into Proseq. 

Population Genetic Analyses and Identification of 
Sequence-Based Outlier Loci 

To investigate the clustering of genetic variation between 
individuals, we concatenated all SNPs from the unphased 
data that were present in all 20 individuals. The program 
STRUCTURE (Falush et al. 2003) was used to assess the 
number of genetic clusters (K) apparent in the data. Four 
runs for each K were carried out for K = 1-6, and the Evanno 
et al. (2005) method (implemented in Structure Harvester, 
Earl and von Holdt 2012) was used to determine the max- 
imal value of delta(K) and hence the most likely value of K. 

To determine whether LD between loci was likely to be 
affecting locus-specific patterns of polymorphism and diver- 
gence due to background selection, we investigated the rela- 
tionship between inter-SNP distance and LD. In the absence 
of the genome sequence, we were limited to intronless data, 
so the distances between SNPs are underestimates in most 
cases, as the length of introns is not taken into account (see 
Results), x 2 tests °f independence were carried out for all 
pairs of SNPs in loci with at least 1,000 gap-free alignment 
sites (n = 1,817 loci) and binned according to distance be- 
tween SNPs. Within each bin, we then counted the propor- 
tion of significant (P< 0.001) x tests (where a significant 
value indicated departure from linkage equilibrium; supple- 
mentary fig. S2, Supplementary Material online). 

For the population genetic analyses, sites with missing data 
were excluded, as were loci with less than 28 S. aethnensis and 
S. chrysanthemifolius alleles (i.e., 14 individuals). Transcripto- 
mewide distributions of Nei and Li's 7t (Nei and Li 1979), a 
measure of nucleotide diversity, and Tajima's D (Tajima 1989), 
a measure of neutrality, were calculated in Proseq (Filatov 
2009) and compared between species. Wright's F ST (Wright 
1951), a measure of genetic differentiation, and absolute and 
net divergence (D xy and D a ; (Nei 1987) were calculated be- 
tween S. aethnensis and S. chrysanthemifolius also using 
Proseq. Fay and Wu's H (Fay and Wu 2000) was calculated 
in DnaSP (Rozas et al. 2003). Monomorphic loci were neces- 
sarily excluded from the calculations of Tajima's D, F ST and Fay 
and Wu's H. In addition, because short alignments could 
result in erroneous calculations of most population genetic 
parameters, we primarily present data from the analysis of 
only those loci with >100 sites available for analysis. 

For demographic inference and visualization of two- 
dimensional SFS (2D-SFS, see fig. 6), we used the dadi package 
(Gutenkunst et al. 2009). To polarize the SNPs, we established 
the ancestral state for each SNP using two outgroups, 
S. vernalis and S. gallicus. These species' transcriptomes 



were sequenced in the same way as the other samples and 
were aligned to the reference and SNPs scored using the same 
settings. Only the SNPs where both outgroups contained the 
same nucleotide were used in the analysis. These outgroup 
species are quite closely related to S. aethnensis and S. chry- 
santhemifolius (maximum divergence at silent sites is ~5%), so 
we consider repeated mutations at the same site in both 
outgroups unlikely. To minimize the effects of selection on 
the demographic inference, only the SNPs at 4-fold degener- 
ate sites were analyzed. Just loci with >500 sites without 
missing data were used (n = 4,851 loci). A modified version 
of Proseq was used to convert SNP data to the format suitable 
for use in the dadi package. 

We tried several demographic scenarios implemented in 
dadi and focused on the isolation-with-migration model with 
population size change (IMpre). The parameters of the model 
are described in the Results, and the most likely inferred sce- 
nario for this model is shown in figure 7. This model was fit to 
our data with all eight parameters s, N v N 2 , M^ 2 > 

and M 2 ^y fig. 7) estimated simultaneously, as well as in runs 
with the migration parameters fixed at zero. Likelihood ratio 
tests were used to determine whether nonzero migration 
parameters improved model fit to data. Repeated runs with 
perturbed starting parameters (perturb jparams function in 
dadi) were used to ensure we found the global maximum. To 
evaluate the robustness of the parameter estimates, we used 
the sample method to create 300 bootstrap replicates of the 
observed SFS. The model was fitted to each of these replicates 
and the parameters were estimated (fig. 7). 

To identify the loci putatively involved in species differen- 
tiation, we used BayeScan (Fischer et al. 2011; Foil and 
Gaggiotti 2008) with loci for which sequences from all 20 
individuals and >100 gap-free alignment positions were pre- 
sent. Frequencies of each haplotype per locus were calculated 
from Proseq outputs and the program run with 150,000 iter- 
ations, discarding the first 50,000 as burn in. Outlier loci based 
on a 1% and 5% FDR were identified. 

Identifying Candidate Genes Based on Expression 
CLC-GW was used to calculate RPKM (Mortazavi et al. 2008) 
for each contig in each S. aethnensis and S. chrysanthemifolius 
individual as a measure of gene expression. This metric is 
standardized by length of the contig across all individuals 
and by total read number for that individual. Values were 
transformed by adding 0.0001 to each score (such that tran- 
scripts undetected in some individuals could still be statisti- 
cally tested for expression differences) and quantile 
normalized. Significant differences in expression between 
the two species were identified using t-tests with an FDR of 
0.05 in CLC-GW. In addition, RPKM values were analyzed in 
NOISeq (Tarazona et al. 2011), a nonparametric approach to 
identify expression differences, using default parameters. 
NOISeq takes into account the noise distribution of the 
actual data (not just the locus in question) and has been 
shown to be effective in controlling the number of false pos- 
itives detected (Tarazona et al. 201 1 ). We compared the list of 
expression candidates with those identified in a previous 
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microarray study of ~ 3,000 transcripts (Hegarty et al. 2009). 
Of the 224 differentially expressed probes identified by 
Hegarty et al. (2009), 210 were recovered from Cenbank 
(www.ncbi.nlm.nih.gov/genbank/, last accessed October 9, 
2013) and of these 128 were also present in our investigation, 
corresponding to 100 different contigs (i.e., some of the 
Hegarty et al. clones are different portions of one transcript 
according to our assembly). We also PCR amplified and 
Sanger-sequenced (using established protocols, e.g., 
Chapman et al. 2008; primers are listed in supplementary 
table S3, Supplementary Material online) a subset of the ex- 
pression candidates from genomic DNA from the same indi- 
viduals for a downstream analysis of F ST in this subset of loci 
(see Results). 

Analysis of Candidate Genes 

Having identified candidate loci using the above tests, we 
carried out Fisher's exact tests in Blast2GO to determine 
whether certain GO terms were overrepresented, relative to 
the balance of the transcriptome, and hence whether the 
signature of selection was more pronounced across loci 
that share pathways or functions in common. Where only a 
subset of the loci had been included in the test, this subset 
was used as the reference set in the GO analysis. Separate tests 
were carried out for two groups of outliers: 1) BayeScan can- 
didates and 2) expression candidates (outliers from the t-test 
and NOISeq analyses combined). 

ML-HKA (Hudson et al. 1987) tests (Wright and 
Charlesworth 2004) were carried out in Proseq for various 
subsets of candidate loci, to try and identify if the pattern 
of polymorphism was indicative of selection at these loci and 
in which species there was evidence for selection. To do this, 
the sequence of each candidate locus from one species at a 
time was compared with 20 randomly chosen putatively neu- 
tral loci, using the S. vernalis transcriptome to calculate inter- 
specific divergence. A singleton SNP was added randomly to 
monomorphic loci because ML-HKA cannot run if 0 - 0. A 
strictly neutral model was run and then compared to a model 
in which the candidate was deemed under selection (100,000 
iterations each). Twice the difference in the likelihoods of the 
two models was calculated and tested for significance using a 
X 2 distribution with one degree of freedom (Wright and 
Charlesworth 2004). 

Supplementary Material 

Supplementary figures S1 -S3 and tables S1 -S3 are available at 
Molecular Biology and Evolution online (http://www.mbe. 
oxfordjournals.org/). 
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